{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tabular Datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have already discovered, Elements are simple wrappers around your data that provide a semantically meaningful visual representation. HoloViews can work with a wide variety of data types, but many of them can be categorized as either:\n",
    "\n",
    "   * **Tabular:** Tables of flat columns, or\n",
    "   * **Gridded:** Array-like data on 2-dimensional or N-dimensional grids\n",
    "   \n",
    "These two general data types are explained in detail in the [Tabular Data](../user_guide/08-Tabular_Datasets.ipynb) and [Gridded Data](../user_guide/09-Gridded_Datasets.ipynb) user guides, including all the many supported formats (including Python dictionaries of NumPy arrays, pandas ``DataFrames``, dask ``DataFrames``, and xarray ``DataArrays`` and ``Datasets``). \n",
    "\n",
    "In this Getting-Started guide we provide a quick overview and introduction to two of the most flexible and powerful formats: columnar **pandas** DataFrames (in this section), and gridded **xarray** Datasets (in the next section)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tabular\n",
    "\n",
    "Tabular data (also called columnar data) is one of the most common, general, and versatile data formats, corresponding to how data is laid out in a spreadsheet. There are many different ways to put data into a tabular format, but for interactive analysis having [**tidy data**](http://www.jeannicholashould.com/tidy-data-in-python.html) provides flexibility and simplicity. For tidy data, the **columns** of the table represent **variables** or **dimensions** and the **rows** represent **observations**. The best way to understand this format is to look at such a dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import holoviews as hv\n",
    "from holoviews import opts\n",
    "hv.extension('bokeh', 'matplotlib')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diseases = pd.read_csv('../assets/diseases.csv.gz')\n",
    "diseases.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This particular dataset was the subject of an excellent piece of visual journalism in the [Wall Street Journal](http://graphics.wsj.com/infectious-diseases-and-vaccines/#b02g20t20w15). The WSJ data details the incidence of various diseases over time, and was downloaded from the [University of Pittsburgh's Project Tycho](http://www.tycho.pitt.edu/). We can see we have 5 data columns, which each correspond either to independent variables that specify a particular measurement ('Year', 'Week', 'State'), or observed/dependent variables reporting what was then actually measured (the 'measles' or 'pertussis' incidence). \n",
    "\n",
    "Knowing the distinction between those two types of variables is crucial for doing visualizations, but unfortunately the tabular format does not declare this information. Plotting 'Week' against 'State' would not be meaningful, whereas 'measles' for each 'State' (averaging or summing across the other dimensions) would be fine, and there's no way to deduce those constraints from the tabular format.  Accordingly, we will first make a HoloViews object called a ``Dataset`` that declares the independent variables (called key dimensions or **kdims** in HoloViews) and dependent variables (called value dimensions or **vdims**) that you want to work with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vdims = [('measles', 'Measles Incidence'), ('pertussis', 'Pertussis Incidence')]\n",
    "ds = hv.Dataset(diseases, ['Year', 'State'], vdims)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we've used an optional tuple-based syntax **``(name,label)``** to specify a more meaningful description for the ``vdims``, while using the original short descriptions for the two ``kdims``.  We haven't yet specified what to do with the ``Week`` dimension, but we are only interested in yearly averages, so let's just tell HoloViews to average over all remaining dimensions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = ds.aggregate(function=np.mean)\n",
    "ds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(We'll cover aggregations like ``np.mean`` in detail later, but here the important bit is simply that the ``Week`` dimension can now be ignored.)\n",
    "\n",
    "The ``repr`` shows us both the ``kdims`` (in square brackets) and the ``vdims`` (in parentheses) of the ``Dataset``.  Because it can hold arbitrary combinations of dimensions, a ``Dataset`` is *not* immediately visualizable. There's no single clear mapping from these four dimensions onto a two-dimensional page, hence the textual representation shown above.\n",
    "\n",
    "To make this data visualizable, we'll need to provide a bit more metadata, by selecting one of the large library of Elements that can help answer the questions we want to ask about the data. Perhaps the most obvious representation of this dataset is as a ``Curve`` displaying the incidence for each year, for each state. We could pull out individual columns one by one from the original dataset, but now that we have declared information about the dimensions, the cleanest approach is to map the dimensions of our ``Dataset`` onto the dimensions of an Element using ``.to``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "layout = (ds.to(hv.Curve, 'Year', 'measles') + ds.to(hv.Curve, 'Year', 'pertussis')).cols(1)\n",
    "layout.opts(\n",
    "    opts.Curve(width=600, height=250, framewise=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we specified two ``Curve`` elements showing measles and pertussis incidence respectively (the vdims), per year (the kdim), and laid them out in a vertical column.  You'll notice that even though we specified only the short name for the value dimensions, the plot shows the longer names (\"Measles Incidence\", \"Pertussis Incidence\") that we declared on the ``Dataset``.\n",
    "\n",
    "You'll also notice that we automatically received a dropdown menu to select which ``State`` to view. Each ``Curve`` ignores unused value dimensions, because additional measurements don't affect each other, but HoloViews has to do *something* with every key dimension for every such plot.  If the ``State`` (or any other key dimension) isn't somehow plotted or aggregated over, then HoloViews has to leave choosing a value for it to the user, hence the selection widget. Other options for what to do with extra dimensions or just extra data ranges are illustrated below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting\n",
    "\n",
    "One of the most common things we might want to do is to select only a subset of the data. The ``select`` method makes this extremely easy, letting you select a single value, a list of values supplied as a list, or a range of values supplied as a tuple. Here we will use ``select`` to display the measles incidence in four states over one decade. After applying the selection, we use the ``.to`` method as shown earlier, now displaying the data as ``Bars`` indexed by 'Year' and 'State' key dimensions and displaying the 'Measles Incidence' value dimension:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "states = ['New York', 'New Jersey', 'California', 'Texas']\n",
    "bars = ds.select(State=states, Year=(1980, 1990)).to(hv.Bars, ['Year', 'State'], 'measles').sort()\n",
    "bars.opts(\n",
    "    opts.Bars(width=800, height=400, tools=['hover'], xrotation=90, show_legend=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Faceting\n",
    "\n",
    "Above we already saw what happens to key dimensions that we didn't explicitly assign to the Element using the ``.to`` method: they are grouped over, popping up a set of widgets so the user can select the values to show at any one time. However, using widgets is not always the most effective way to view the data, and a ``Dataset`` lets you specify other alternatives using the ``.overlay``, ``.grid`` and ``.layout`` methods. For instance, we can lay out each state separately using ``.grid``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = ds.select(State=states, Year=(1930, 2005)).to(hv.Curve, 'Year', 'measles')\n",
    "gridspace = grouped.grid('State')\n",
    "gridspace.opts(\n",
    "    opts.Curve(width=200, color='indianred'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we can take the same grouped object and ``.overlay`` the individual curves instead of laying them out in a grid:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ndoverlay = grouped.overlay('State')\n",
    "ndoverlay.opts(\n",
    "    opts.Curve(width=600, color=hv.Cycle(values=['indianred', 'slateblue', 'lightseagreen', 'coral'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These faceting methods even compose together, meaning that if we had more key dimensions we could ``.overlay`` one dimension, ``.grid`` another and have a widget for any other remaining key dimensions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregating\n",
    "\n",
    "Instead of selecting a subset of the data, another common operation supported by HoloViews is computing aggregates. When we first loaded this dataset, we aggregated over the 'Week' column to compute the mean incidence for every year, thereby reducing our data significantly. The ``aggregate`` method is therefore very useful to compute statistics from our data.\n",
    "\n",
    "A simple example using our dataset is to compute the mean and standard deviation of the Measles Incidence by ``'Year'``. We can express this simply by passing the key dimensions to aggregate over (in this case just the 'Year') along with a function and optional ``spreadfn`` to compute the statistics we want. The ``spreadfn`` will append the name of the function to the dimension name so we can reference the computed value separately. Once we have computed the aggregate, we can simply cast it to a ``Curve`` and ``ErrorBars``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "agg = ds.aggregate('Year', function=np.mean, spreadfn=np.std)\n",
    "errorbars = hv.ErrorBars(agg,vdims=['measles', 'measles_std']).iloc[::2]\n",
    "overlay =  (hv.Curve(agg) * errorbars).redim.range(measles=(0, None))\n",
    "\n",
    "overlay.opts(width=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this way we can summarize a multi-dimensional dataset as something that can be visualized directly, while allowing us to compute arbitrary statistics along a dimension.\n",
    "\n",
    "## Other data\n",
    "\n",
    "If you want to know more about working with tabular data, particularly when using datatypes other than pandas, have a look at the [user guide](../user_guide/07-Tabular_Datasets.ipynb). The different interfaces allow you to work with everything from simple NumPy arrays to out-of-core dataframes using dask. Dask dataframes scale to visualizations of billions of rows, when using [datashader](https://anaconda.org/jbednar/holoviews_datashader/notebook) with HoloViews to aggregate the data as needed."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
